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Abstract 



This study investigates the amount of uncertainty added to NAEP estimates by equating 
error under both ideal and less than ideal circumstances. For example, circumstances led 
to a situation in which the 1994 to 1992 reading assessment equating had to be based on 
a set of common items that was both smaller, and more heavily weighted toward multiple 
choice, than anticipated. If performance on the two types of items does not change at the 
same rate over time, such equatings might introduce systematic bias in trends measured 
from equated scores. Data from past administrations are used to guide simulations of 
various (better and worse) equating designs, and error due to equating is estimated 
empirically. 

The design includes a variety of factors that might affect accuracy of equating, with the 
levels of each factor based roughly on operational values in the NAEP 1992 and 1994 
reading and 1992 mathematics assessments. The purpose is to estimate the approximate 
additional uncertainty that might be introduced by equating from one assessment wave to 
the next, and to determine what factors in the equating design contribute most to that 
uncertainty. The specific factors investigated were number of items in the scale, the 
proportion of items in the scale taken by each student, the proportion of items in each 
administration which are common, the proportion of each item “type” in each scale, the 
proportion of each item type among common items used for equating, the scale linking 
strategy (IRT invariance, common item, or multiple group IRT linking), and the change 
in ability from wave 1 to wave 2. 

Common item scale linking performed very well, even under circumstances which were 
far from ideal, including slight to moderate multidimensionality. Mean bias was esti- 
mated to be no more than about 0.01 to 0.02 standard deviations (about 0.5 to 1.0 
NAEP scale points). However, in nonideal conditions there were biases in the extreme 
quantiles (5 percent, 10 percent, and 25 percent points) of the ability distribution, even 
with no population shifts. These biases were several times as large as the mean bias and 
could be large enough to create problems in tracking low performance and the means of 
low performing groups over several waves of assessment. When both waves of data can 
be scaled together, multiple group IRT methods provided very accurate scale linking, 
with virtually no bias. 
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Introduction 



In this study we examined the problem of equating error in NAEP'like assessment designs 
with complex samples and conditioning with multiple imputation, under conditions that 
closely resemble those of an operational assessment. The simulations were based on 
characteristics of the 1992 and 1994 NAEP reading and mathematics assessments. 

A major question was how to incorporate “real data” — that is, characteristics of the actual 
assessment — into the simulation design. Several approaches were considered. One 
approach is to use item response strings from real respondents. This has the advantage of 
producing absolutely real data. It would incorporate, for example, the degree to which 
real item responses fail to conform to the item response model used in the analysis. 
However it has the important disadvantage that the generating parameters (other than 
scale length, number of items taken and number of items that are “common”) are not 
under the control of the investigator and cannot be exactly known. 

Another alternative is to use real data to derive reasonable values of person ability and 
item parameters and then simulate item response strings based on those (known) 
parameter values. This has the advantage of complete control over all relevant 
parameters (and knowledge of their values). If the values of these parameters are derived 
from estimates in operational assessments, they should be a good approximation to reality. 
However, real data may fail to conform to our analytic models in ways we do not fully 
understand (e.g., they may not fit the item response or multiple imputation model). This 
procedure has the disadvantage that it cannot capture the consequences of the misfit of 
real data to our analytic models. One might see the latter approach as suggesting a lower 
bound for errors when the rest of the model fits exactly. 

We decided to use this latter approach — to simulate data that fit the item response model 
rather than use item response strings from real people. Specifically, we used the 
distribution of person ability parameters and caseweights obtained in the 1992 and 1994 
NAEP reading and mathematics assessments, item parameters selected from the values for 
items in these same NAEP assessments, and the correlation between background 
variables (used in conditioning) and ability scores observed in the 1992 NAEP 
assessment. 

Population. Two populations of person ability parameters were used in this study: one 
derived from the 1992 NAEP mathematics assessment and the other from the 1994 
NAEP reading assessment. They were derived by taking the abilities from a random 
sample of 4|000 cases each from the 1992 NAEP mathematics and the 1994 NAEP 
reading samples for 17-year-olds. The average of each person’s five plausible values 
served as the generating values of the person ability populations. The weights for these 
cases were preserved for the analysis as well. 

Items* For the reading study, we used a sample of the item parameters from the 1992 
reading assessment for 17-year-olds to serve as the generating item parameters. For the 
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mathematics study, we used the item parameters of the 1992 NAEP mathematics 
assessment for l7'year-olds. All item parameters were taken from the 1992 NAEP 
technical manual. 



Design 

The rationale for the design is guided by features of operational NAEP and the 
implementation of the short-term trend studies as part of the main assessment. Seven 
factors defined the conditions we investigated initially: 

Total number of items* Although the overall item pool in NAEP is large, scaling is 
carried out within individual scales that have relatively small numbers of items. For 
example, the 1994 NAEP reading scales ranged from 20 to 40 items and the 1992 NAEP 
mathematics scales ranged from 21 to 47 items. Two scale lengths are examined in this 
simulation — a short scale of 24 items and a long scale of 48 items. 

Proportion of items taken by each student* In order to obtain information about a 
range of items, each student who takes items on a particular scale takes only a fraction of 
the total NAEP item pool on each scale. For example, in the 1994 reading assessment, a 
student typically took one or two reading blocks corresponding to about one fourth of the 
items on a scale (if only one item block corresponding to a scale was taken) or one half of 
the items on a scale (if two blocks corresponding to the same scale were taken). This 
simulation examined two situations corresponding to every student taking one fourth or 
one half of the total number of items on the scale. 

Proportion of items treated as common in equating* Although the same items are used 
for each wave of the short-term trend studies in NAEP, not all of these items are treated 
as common for the purposes of equating one assessment wave to the next. When the 
parameters of an item drift too much from one assessment wave to the next, those items 
are not included among the “common” items used for equating. Such drifting of item 
parameters is more likely to occur for constructed response items where there has been a 
change in the scoring procedures. For example, in the 1992 to 1994 reading short-term 
trend analysis within the main assessment, improvements in the scoring procedures for 
constructed response items led to decisions that only 57 percent to 85 percent of the 
items could be used as common items for the purposes of equating. Like NAEP, the 
simulation reported here used the same items for both waves of the assessment, but 
examined two situations: one where 50 percent of the items were treated as common 
in the equating and the other where 100 percent of the items were treated as common in the 
equating. 

Proportion of Type I items* There are two types of items in NAEP — multiple choice 
items and constructed response items. Constructed response items are further subdivided 
into short constructed response items (which are scored dichotomously, but with the 
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guessing parameter set to zero) and extended constructed response items (which are 
scored using a partial credit model). Overall, constructed response items made up from 47 
percent to 81 percent of the total items in the three reading scales used in the 1994 
NAEP reading assessment. In this simulation two types of items are included, which are 
labeled Type I and Type II, corresponding to multiple choice and short (dichotomously 
scored) constructed response items, respectively. Two scale types were investigated, one 
with relatively few (50 percent) Type I items and with a larger proportion (two -thirds to 
three-quarters) of Type I items. The exact proportions were varied somewhat to 
accommodate other factors in the design. 

The notion that Type I and Type II items measured slightly different ability dimensions 
was realized by using the model underlying the Bock, Gibbons, and Muraki (1988) full 
information item factor analysis model. Let 0j. be the ability of the ith person on the first 
ability dimension (corresponding to what is measured in common by Type I and Type II 
items) and let be the ability measured only by Type II items. If Type I and Type II 
items correspond to multiple choice and constructed response items, respectively, then 0^. 
might correspond to a dimension of general knowledge and 0^j might correspond to a 
production dimension measured only by constructed response items. The operational 
ability for person i is + (1“^ )® 2 i» where the value of X. is determined by the type of 

item. In this simulation we used the value X. = 1 for Type I items and X = .9 for Type II 
items. 

Only dichotomously scored (short) constructed response items were examined in this 
simulation for two reasons. The first is conceptual. The vast majority of constructed 
response items are of the short constructed response type. For example, 80 percent of the 
constructed response items in the 1994 NAEP reading assessment for 17-year-olds and 
87 percent of the constructed response items in the 1992 NAEP mathematics assessment 
for 17-year-olds were short constructed response items. Moreover, the extended 
constructed response items actually showed less of a tendency to drift than did the short 
constructed response items in the 1992 to 1994 reading short-term trend analysis. The 
second reason that extended constructed response items were not used in the simulation 
was that it would have required software that was not available to us (a NAEP proprietary 
program combining Bilog and Parscale). 

Proportion of Type I items treated as common for equating* In the 1992 to 1994 NAEP 
reading short-term trend analysis within the main assessment, the items that were not 
included as common items used for equating were exclusively constructed response items. 
Consequently, although the scales were composed of between 47 percent and 7 1 percent 
constructed response items, the common items used for equating had a much smaller 
proportion of constructed response items, between 13 percent and 63 percent. It is 
unclear what effect on equating might arise when the items on a scale are predominantly 
of one type (e.g., constructed response) but the items treated as common for the purposes 
of equating are predominantly of another type (e.g., multiple choice). In this simulation, 
the proportion of Type I items used as common items for equating ranged from 16.7 
percent to 50 percent. 
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Type of equating and scale linking* Two alternative strategies for equating assessment 
waves and linking scales were investigated in the main simulation: one based on strict 
IRT invariance (which has been proposed for, but is not used in, operational NAEP) and 
the other based on common item linking, which is similar to the strategy used in opera- 
tional NAER In addition, we investigated a new strategy for equating and linking based 
on multiple group IRT (Bock and Zimowski, 1997). 

Change in ability from one assessment ivave to the next* One of the problems that 
contributes to the difficulties in linking scales in NAEP is that the ability distribution is 
changing from one assessment wave to the next. For example the change from 1992 to 
1994 in reading for 17 'year-olds was about 0.12 standard deviations. In this simulation 
we examined changes of 0.0 and 0.15 standard deviations between assessment waves. In 
initial trials of this simulation, both ability dimensions (0^ and 0^) were changed the same 
amount. However, because change alone was not the primary interest, but differential 
change in the two ability dimensions, changes were introduced in the simulation so that 
the change in ability was only 90 percent as great on Type II items as on Type I items. TTius 
the change introduced was -0.15 on Type I items, but it was only -0.135 on Type II 
items. 

The seven, two- level factors in the design description yield a total of 128 combinations 
in the completely crossed design. Initial investigation suggested that 80 of these cells were 
of most interest in that they posed the substantial challenges to equating and scale 
linking. Consequently our analyses and reporting have concentrated on these 80 
combinations of factors. These 80 combinations can be most easily described in terms of 
20 cells defined by the first five design factors, crossed with the final two factors. We will 
refer to the cells defined by the first five factors in the results that follow. 

The following table describes the item layout that was generated for each replication of 
each cell of the design defined by the first five factors described above. 
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Item Layout: First Five Design Factors 



Factor 





1 

Scale 

Total 

Items 


2 

Items 

Taken 


3 

Total 

Common 

Items 


4 

Type 1 
Items 


5 

Common 
Type 1 
Items 


Cell 












1 


48 


12 


6 


6 


1 


2 


48 


12 


6 


6 


3 


3 


48 


12 


6 


7 


1 


4 


48 


12 


6 


8 


3 


5 


48 


12 


12 


6 


6 


6 


48 


24 


12 


12 


2 


7 


48 


24 


12 


12 


6 


8 


48 


24 


12 


14 


2 


9 


48 


24 


12 


16 


6 


10 


48 


24 


24 


12 


12 


11 


24 


6 


3 


3 


1 


12 


24 


6 


3 


3 


2 


13 


24 


6 


3 


4 


1 


14 


24 


6 


3 


4 


2 


15 


24 


6 


6 


3 


3 


16 


24 


12 


6 


6 


2 


17 


24 


12 


6 


6 


4 


18 


24 


12 


6 


8 


2 


19 


24 


12 


6 


8 


4 


20 


24 


12 


12 


6 


6 



Scale Linking 

In every scale linking there is a calibration step and a scaling step. The calibration step 
involves obtaining item parameter estimates from a computer program. As in NAEP, 
Bilog was used to do this calibration. The scaling step takes the output from the calibra^ 
tion program and turns it into scaled proficiency scores. This step is always a linear 
transformation, so if T. is the scaled proficiency score and 9. is the output from the 
computer program, 

T. = A + B0.. 

1 I 
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There are two waves of test administrations: 1 and 2. 



There are also potentially three sets of item parameter estimates: those based on 
calibrating data from administration I, those based on calibrating data from 
administration 2, and those based on calibration of the common data across the two 
administrations. Call these sets of item parameters 1, 2, and C respectively. 

Denote the ability score for person i in administration (time) j, estimated from parameter 
k by 0.(j, k), where the ith person in assessment wave 1 is not the same individual as the 
ith person in wave 2. Then we have: 

0.(1,1): the ability of the ith person wave 1, estimated using item parameters from 
the wave 1 calibration 

0.(1,C): the ability of the ith person in wave 1, estimated using item parameters 
from the common calibration 

0.(2,C): the ability of the ith person in wave 2, estimated using item parameters 
from the common calibration. 

Define the moments of the 0.(j, k) via: 

M(j, k) = E^[0.(j, k)] (the weighted sample mean) 

S(j, k) = VVAR^[0.(j, k)] (the weighted standard deviation). 

Denote the scaled scores corresponding to the above ability estimates as for 0. above 
except with a capital T instead of 0, e.g., T(j, k) instead of0.(j, k). 

Note that we need to define the scaling parameters A and B to define the linking. We 
start with a scale defined by A and B which are given a priori. Note that A and B are the 
mean and standard deviation of the scale if M( 1,1) =0 and S( 1,1) = 1, but there may be 
cases when this is not true (as when wave 1 data has been previously linked to an earlier 
scale). 



IRT Invariance Linking 

In IRT invariance linking we link scale 1 to scale 2 as follows: 

T(2,1) = A + B*a(2,l). 

Note that this notation implies that all common items in wave 2 are constrained to have 
their wave 1 parameters and the parameters of noncommon items are unconstrained. This 
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is perhaps the strictest version of IRT invariance linking in that item drift (among items 
taken as common for the purposes of equating) is uncontrolled and is therefore con- 
founded with change in ability. Other linking possibilities exist. For example, the 
method of Stocking and Lord ( 1983) imposes an item drift model by constraining the 
mean and variance of the item parameters, but allowing the particular values of item 
parameters to drift subject to this overall constraint. We evaluated this strict IRT 
invariance model because it represents one extreme which performed surprisingly well in 
the study of NAEP equating performed by Mazzeo and Donoghue (1995). 



Common Item Linking 

In common item linking we link scale 1 to scale 2 as follows: 

T.(2,C) = A^j + B^j*6.(2,C), where we derive A^j and by making sure that 
the mean and variance of T.(1,C) are equal to those of T(l,l). 

Note however that there is no reason to believe that M(1,C)=0, even if M(1,1)=0, since 
the common calibration will change item parameters and therefore the ability scores. 
Similarly, there is no reason to believe that S(1,C)=1 even if S(l,l)=l. 

Since T(1,1) = A + B6.(l,l) has mean A + B*M(1,1) and standard deviation B*S(1,1), 
then it follows that {[6^(1 ,C) - M( 1,C)]/S(1,C)}*B*S(1,1) + A + B*M(1,1) also has mean 
A + B*M(1,1) and standard deviation B*S(1,1), since the term in brackets is just a z— 
score. Collecting terms we get that: 

A^, = A + B*M(1,1) - [M(1,C)/S(1,0], B^, = B*S(1,1)/S(1,C). 



Multiple Qroup IRT Linking 

Multiple group IRT provides an alternative to the two linking strategies outlined above 
(Bock and Zimowski, 1997). Multiple group IRT makes it possible to simultaneously 
scale the items in several populations, using the distribution of ability in one of the 
populations to anchor those of the other populations and providing automatic scale 
linking. When several populations (e.g., several waves of trend data) can be scaled at 
once, this method has theoretical advantages over the other models considered here. 
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Item Types 

We will assume that the items actually follow a variant of the multidimensional item 
response model given in Bock, Gibbons, and Muraki (1988), where there are two ability 
dimensions and a three parameter logistic item response model. In this model, the items 
function as if there was a single ability which is a linear combination of the two individual 
ability dimensions. The coefficients of this linear combination determine the factor 
loading of each item on the ability factors. Thus if 9^., and 9^^ are ability scores on the two 
abilities for person i, person i can be treated as if there were a single ability factor and the 
ability score for item j was: 

Thus to simulate Type I and Type II items, we generate two independent abilities such 
that the 9 values have the required distribution while letting take on one value for 
Type I items and another value for Type II items. We generate t\ie 9^. and 9.,. values by 
assuming that they are uncorrelated, but that both abilities are equally correlated with 
the background variables. 



Multiple Imputation 

In our simulation study we are able to make certain simplifying assumptions that make 
computations easier. Since we estimate only one scale at a time, our ability scores are 
univariate, not multivariate as in the main NAEP. Similarly, we can treat the background 
variables as a single variable (one optimal composite of all the background variables). 
This section is an attempt to clarify the procedures and the notation we will use: 

X. — the ith person’s item response string, with elements x.' = (x..), 

y. — the ith person’s background characteristics (all rolled into one variable), 

9. — the ith person’s ability parameter, 

Y — the slope coefficient linking 9. and y., 

a — the residual standard deviation in the above regression. 

The posterior distribution of 9. given x., y., y, and a is given by: 

p{Q. \ X., y,,Y,CT) = P(x. 1 9., y.,Y,CT)p(0. | Yj.y.ct). 
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Since the item response model says x. depends only on 0., it follows that: 

P(x. I 0.,y.,y,a) = P(x. | 0.) = Product over items p(x.. | 0.), where p(x.. | 0.) is just 
the probability that person i gets item j correct, which is given as a function of 

0. by the logistic IRT model. 

The conditioning model says that 0. depends on background variable y. via a linear 
regression. We can standardize y., and 0. is already standardized, so: 

0j = yy. + 8., where 8. ~ N(0, c^) 

Therefore p(0. | yj,y,cj) = (J)(0.-yy.)/c)/c, where (j) is the standard normal probability 
density function. In the univariate case y = correlation(y.,0.), and = 1 -y^, so ct 
determines y. 

We use NAEP’s multiple imputation process (which includes the conditioning on 
background variables). The process has three steps: 

1 . Draw a value of y from the normal approximation of p(y,c | x., y.) fixing a at 
its mean. We set a, so we know its value and since a determines y entirely, 
this step is trivial. 

2. Given y and a and y., get the maximum likelihood estimate of the mean and 
variance and a^2 of the posterior distribution of 0. given Xp y., y and a. 

3. Sample 5 0. values from a normal distribution with mean and variance 
a^2 — these are the plausible values. 



Qenerating Values for Multiple Imputation 

The NAEP technical manual reports the amount of variance in the 0.’s that the 

background variables account for (the values) in the 1992 NAEP analysis. In reading, 

the proportion of variance accounted for is 0.40 (based on 39 conditioning variables) in 

the long-term trend and about 0.58 for each of the three reading scales (based on 1 15 

principal components from 218 background variables) in the main assessment. In 

mathematics the proportion of variance accounted for by the background variables 

ranged from 0.20 to 0.31 for the five mathematics scales in the main assessment (based on 

138 principal components from 238 background variables). Thus an R^ of 0.25 (for math) 

and 0.52 (for reading) were chosen for the (squared) correlation between background 

variables and 0.. 

! 
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Simulation Methods 



Overview 

To the degree possible, control of the simulations was automated. We directed the 
sequence of program runs needed to complete the study of test equating by automatically 
generating batch files that called the necessary executables for simulation of data, 
calibration, equating, generation of plausible values, and assessment of the plausible 
values’ distribution. We used the public release of Bilog 3 (as described in Mislevy & 
Bock, 1990) for all scaling except that in the multiple group IRT scaling analyses, where 
we used Bilog^MG (as described by Bock and Zimowski, 1997). The other steps employed 
programs written specifically for this project in the C programming language. Within a 
particular cell of the design, we performed the following steps to evaluate IRT invariance 
equating and common item equating: 

1. Generate data for original (wave 1) assessment. 

2. Calibrate the data, using Bilog. 

3. Generate plausible values for original (wave 1) assessment. 

4. Assess the distribution of the plausible values. 

5. Generate wave 2 data, assuming no change in ability. 

6. Calibrate wave 2 data, using IRT invariance equating strategy. 

7. Generate plausible values. 

8. Assess the distribution of the plausible values. 

9. Calibrate wave 2 data, using common item equating strategy. 

10. Generate plausible values. 

1 1 . Assess the distribution of the plausible values. 

12. Generate new wave 2 data, assuming a change in ability (0^) of ^0.15 
standard deviations. 

13. Calibrate the new wave 2 data, using IRT invariance equating strategy. 

14. Generate plausible values. 

15. Assess the distribution of the plausible values. 
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16. Calibrate the new wave 2 data using common item equating strategy. 

17. Generate plausible values. 

18. Assess the distribution of the plausible values. 

For multiple group IRT equating we generated new wave 1 and wave 2 data, equated 
using Bilog'MG, and generated and analyzed plausible values as before. 

In the paragraphs that follow, specific details of implementing each step of the simulation 
are described. 



Data Qeneration 

It is convenient to think of the data generation process as comprising two stages: 
generation of abilities and background data, and generation of item response strings. 

Recall that we actually conceive ability as being two-dimensional; the first dimension 
represents the ability assessed by Type I items, and the second dimension is the additional 
capability required to complete Type II items successfully. Values for the first dimension of 
ability (denoted 0.) were sampled from the plausible values for 1 7-year-olds in the 1994 
NAEP reading assessment or 1992 NAEP math assessment. We sampled 4002 cases, along 
with case weights, and rescaled the values so that the weighted mean and variance were 
zero and one, respectively. We sampled 4002 values so that for design cells with six blocks 
of items we could administer each block to 667 putative individuals; for cells with four 
blocks of items, we omitted two of the sampled values and administered each block to 
1000 individuals. The same sample of 4002 values (or the first 4000 cases of that sample) 
was employed in every cell. When the design called for a simulated shift in ability, we 
simply subtracted 0.15 from each value. The distributions were slightly negatively 
skewed, with some suggestion of a possible ceiling effect; this lack of symmetry was more 
pronounced in the reading distribution than in the mathematics distribution. Figure 1 
shows the approximate shape of the distributions, although the histogram does not 
account for case weights. Abilities on the second dimension (B^) and values for the back- 
ground variable were pseudo-randomly sampled from the standard normal 
distribution.' After sampling, we rescaled each distribution to have exactly zero mean and 
unit variance. We then achieved the desired correlational structure by multiplying the 
matrix comprising the columns of abilities and the background variable by the Gholesky 
decomposition of the target correlation matrix. 

We employed a modification of Bock’s full-information factor analysis model (Bock, 
Gibbons & Muraki, 1988) to define the probability of an individual with particular values 



1. Here, and throughout the simulation, pseudorandom normal numbers were generated by the polar 

method (Knuth, 1981; Algorithm P) Uniform numbers were generated using a custom implementation 
of Marsaglia’s( 1991) portable random number generator. 
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of 0. and passing an item. The modifications involved two aspects. First, we employed 
a logistic probability model, rather than the normal ogive approach. Second, we adjusted 
the model to accommodate guessing. The resulting probability equation was: 

p (x,. = 1 le,) = g, + 

1 + exp ['aj(b- X,0|. + ( 1 - A.)©^.))] 

where a, b, and g are the slope, threshold, and guessing parameter of the usual three 
parameter logistic IRT model, X is a mixing coefficient bounded by zero and one, and x.. is 
equal to one when person i responds correctly to item j. Given paVticular values of item 
and person parameters, a “correct” response was generated when a uniformly distributed 
pseudorandom number was less than the probability derived from the equation; 
otherwise, a failure was generated. 

Item parameters for each cell of the simulation design were selected from the values 
reported for the dichotomously scored items in the reading or mathematics assessments 
for 17'year'olds in the 1992 NAEP technical report. Type II items were chosen from those 
items with guessing parameters fixed at 0.0; Type I items were chosen from among the 
others. We made an effort to keep the average threshold parameters in each block of six 
items (cells ITZO) or 12 items (cells TIO) near the overall mean threshold of approxi- 
mately negative 0.5 for reading and 0.0 for mathematics. When cells differed only in the 
number of common versus not common Type I items, the same generating item parameters 
were used whenever possible. Within a cell, the same item parameters were used to gener- 
ate wave one and wave two data. 



Calibration and Equating 

The wave one data were calibrated using Bilog 3 for EX3S, with strong priors constraining 
the intercept parameters of Type II items to be near zero (a value of 0.001 was actually 
employed to avoid possible numerical difficulties associated with fixing a prior mean that 
fell on the boundary of values allowed under a beta distribution). A special computer 
program automatically generated the Bilog command file. At the completion of Bilog^s 
item parameter estimation, estimates were preserved in a copy of the item output file; we 
then used Bilog^s expected a posteriori ability estimation module, rescaling so that the 
sample ability estimates had a mean of zero and unit variance. We generated wave two 
data, and another special program generated a new Bilog command file that placed strong 
priors on the common items, fixing them at the values output at the completion of phase 
two in the previous estimation. The program also read the relevant Bilog output file from 
wave one estimation to find the rescaling constants that were employed to achieve stan- 
dardized ability estimates, and wrote the new Bilog command file in such a way that 
rescaling of the new ability estimates would employ the same constants. The resultant 
ability estimates thus represent estimates equated by the IRT invariance strategy. The same 
Bilog command file was employed for invariance equating of the wave two data with an 
ability shift (on the 0^ dimension only) of 0.15 standard deviations. 




n 

4 



6 



A Study of Equating in NAEP 



13 



We implemented common item equating by a similar mechanism. A special computer 
program generated a Bilog command file that simultaneously scaled all 8000 or 8004 cases 
(including both wave one and wave two data). We treated common items as the same 
regardless of which time they were employed; non-common items were treated as 
distinct, even though they had the same generating values at both times. Thus, in a cell 
with 48 items of which 24 were common, Bilog was instructed to scale 72 items, divided 
among 12 test forms. Once again, we fixed the guessing parameters of Type II items at 
approximately zero. We instructed Bilog to produce ability estimates in the standard 
metric. Then a separate program derived rescaling constants A^j and B^j based on the first 
4000 or 4002 estimated abilities from the common scaling run. The program wrote a 
Bilog command file that fixed all item parameters at their previously estimated values, 
and applied the newly derived scaling constants. The resultant abilities thus represent 
estimates equated by the common item strategy. 

We implemented multiple group IRT by jointly calibrated wave one and wave two data 
together (treated as two groups) using Bilog^MG. The scales were linked by virtue of the 
joint estimation of item parameters. This implementation illustrates the potential of 
multiple group IRT if it were used to simultaneously scale two or more waves of trend 
data. Such a use would be possible when a new trendline was established or an old one 
rescaled to improve comparability across years. 



Qeneration and Assessment of Plausible Values 

The problem of generating plausible values was considerably simpler in our case than in 
the real NAEP analyses, since the multivariate nature of the background variables was 
simplified to a univariate relationship. Recall that at the data generation stage, we 
produced a background variable, y, which was correlated with 0j and the value of the 
correlation was y. The background variable was scaled to have mean zero and variance 
one, or a mean of '0.1 5/y when the ability was shifted. If ability were truly 
unidimensional, then the mean and variance of the posterior distribution of a particular 
person’s 0 could be obtained by appropriate manipulations of the integral of 0 times its 
posterior density, and 0^ times the posterior density of 0. The posterior density is 
proportional to: 



P(0. 1 x.,y.,y, ct) = (j)((0 -yy.)/CT)/CT7i p (x.. 1 0.), 

) 

where x.. is the j'^ element of individual i’s item response string, a is the residual standard 
deviation in the regression (and is thus wholly determined by y, since it is equal to the 
square root of 1 - y^), and (j)(z) denotes the standard normal probability density function 
evaluated at z. We evaluated these integrals numerically for each individual’s ability, 
following a procedure that involved several steps. First, we identified an appropriate range 
for integration (i.e., a range over which the function was numerically non^zero). Next, we 
integrated to get the normalizing constant. Finally, we integrated 0 and 0^ times the 
normalized posterior density. The posterior mean was then taken to be the numerical 
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result of the integral involving 9, and the posterior variance was the second integral 
minus the squared posterior mean. We then generated five plausible values for each 
individual, by randomly sampling from the normal distribution with the obtained mean 
and variance. We calculated the weighted means, variances, skew indexes, and kurtosis 
indexes, as well as nine quantiles, for the 4000 (or 4002) replications of each of the five 
plausible values. The results, presented in Section A and discussed in the Results section, 
are the means of the five instances of each statistic. 



Results 

The results of this study suggest that the common item equating and scale linking used in 
NAEP perform rather well on the average, even when each student takes only one 
quarter of the items on the scale and the equating is based disproportionately on one type 
of item. The average bias due to equating is the estimated difference in the mean of the 
scaled ability distribution between one assessment wave and the next minus the change 
in the actual means of the distributions of ability parameters for the two waves. While the 
bias was statistically reliable in some cases (it was several times its standard error) it was 
never large in comparison to the real changes that have been observed in NAEP. The 
maximum bias in the scale mean under any of the conditions examined was only about 
0.01 standard deviations in reading and 0.02 standard deviations in mathematics, which 
(given a typical NAEP scale standard deviation of about 40) is about 0.5 to 1.0 scale 
points. Multiple group IRT methods have the potential to produce even smaller biases. 
The results for simulations based on ability distributions and item parameters for reading 
and mathematics are discussed in detail below, followed by those of the simulation of 
multiple group IRT equating for both subject matters. 



Reading 

For common item equations, the maximum bias in the scale mean under any of the 
conditions examined was only about 0.011 standard deviations which (given a typical 
NAEP scale standard deviation of about 40) is about 0.5 scale points. Table A.l presents 
the mean bias (mean for wave 2 minus mean for wave 1 minus the true change) for 80 
conditions selected from the design which are the conditions under which it is most 
difficult to achieve equating and scale linking. 

TTie pattern of bias suggests a few generalizations. Common item equating generally 
appears to work best when the proportion of Type I items on the scale is the same as the 
proportion of Type I items used as common items for equating. When these proportions 
are highly unequal (that is when the common items used for equating are 
disproportionately Type I items but the entire scale is not), then equating is poorest. 
Population shifts generally, though not always, make equating more difficult. Surprisingly, 
these data suggest that scale linking is not necessarily less biased for longer scales or when 
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more of the items are taken by each student. The largest bias occurred when 24 items 
(one half) of a 48 items scale were taken by each student. 

While common item equating and scale linking performs remarkably well, it should be 
noted that IRT invariance equating and scale linking does not. Table A.l shows that 
when the mean bias is large, the bias using IRT invariance linking can be several times as 
great as that of common item linking. The biases found here could be larger than 1 .5 
NAEP scale points, which is not negligible in absolute terms or in comparison to typical 
NAEP sampling standard errors. 

Higher scale moments* In addition to comparing the means of the equated scales, the 
variances of the wave 1 and wave 2 (linked) scales were also compared. Table A. 2 
presents the ratio of each wave 2 scale variance to that of the original (wave 1) scale. 
While it appears that the scales linked by the common item equating usually had larger 
variances than the original scale, the increase in variance is small. Scale variances for IRT 
invariance linked scales appear to be somewhat closer to the original scale variances. No 
particularly notable patterns in the variance ratios are apparent. 

The third and fourth moments of the original (wave 1 ) and linked distributions were also 
compared. The differences between these statistics for the wave 1 distribution and those 
of the linked (wave 2) distributions are given in tables A.3 and A.4. Since the nature of 
the population shift from wave 1 to wave 2 is a constant movement, one would expect 
these differences to be zero if the linking were perfect. It appears from these statistics that 
common item linking performs well, even in situations where it would be expected to be 
perform least well. 

Comparisons of scale quantiles* Another way the linked (wave 2) distributions were 
compared with the original (wave 1) distributions was by comparing the quantiles (the 1 
percent, 5 percent, 10 percent, 25 percent, 50 percent, 75 percent, 90 percent, 95 
percent, and 99 percent points of the distribution). Figures B.2 through B.20 in Section B 
use these quantiles to illustrate the cumulative distribution of the original (wave 1 ) 
distribution and the four linked (wave 2) distributions for the 20 configurations of items 
discussed above. In each case there are two groups of ogives, with the curves in each group 
virtually indistinguishable from one another. One group including the original (wave 1) 
distribution and the linked distributions with no population change. The other group 
corresponds to the two linked distributions with the -0.15 population change (see figures 
B.2— B.2 1 in Section B). 

These figures illustrate that the distributions match reasonably well in many cells. 
However, in some cells, there are differences between the quantiles of the linked 
distributions and what might be expected with perfect equating. These differences are 
often, although not always, larger in the lower quantiles than at the upper part of the 
distribution. Further, the differences occur even when there are no population changes. 
For example in cells 6, 7, and 8 (where equating is generally poorest) the 5 percent, 10 
percent, and 25 percent points in the wave 2 distribution obtained by common item 
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linking differ from the corresponding quantiles of the wave 1 distribution by about 0.05, 
0.04, and 0.03 standard deviations, respectively. These biases are statistically reliable, 
being several times their standard errors. Assuming a typical NAEP standard deviation of 
about 40 points, these biases would suggest that changes at these quantiles could be 
misestimated by as much as 2 NAEP scale points, which would not be negligible. 

Detailed information on comparisons of scale moments. Tables A. 5 and A. 6 provide a 
more detailed report of the scale means for the 80 conditions previously discussed, 
including the standard errors of each mean. Note that the means of the original (wave 1) 
distribution are not identically zero. The reason is that, although the distribution of 
generating values may have had a mean of 0 and a variance of 1, the ability values 
estimated after scaling with a different set of items would no longer have a mean of zero. 
Since different cells in the design called for items with somewhat different characterise 
tics, the means in the wave 1 distribution are slightly different in each cell of the design. 

Tables A. 7 and A. 8 provide detailed information for the scale variances. Note that the 
variances of the original (wave I) distribution of abilities are not all 1. As in the case of 
the means, even if the generating distribution of abilities had a mean of one, the ability 
values estimated after scaling with a different set of items would no longer have a variance 
of one. Since different cells in the design called for items with somewhat different 
characteristics, the variance of the wave 1 distribution is slightly different in each cell of 
the design. 

Tables A. 9 and A. 10 give the corresponding values for the scale skewness, while tables 
A. 11 and A. 12 provide a summary of the scale kurtosis values. Tables of the quantiles are 
not included, but they have been produced and are available on request. 

Effects of increasing multidimensionality. The value of X used for Type II items in the 
main simulation (X = 0.9) was chosen as probably reasonable after some examination of 
the literature and discussion with the members of the NAEP Validity Studies Panel. To 
see whether a smaller value of X, corresponding to a higher degree of multidimensionality, 
would have a more deleterious effect on equating, one cell of the design was rerun with 
A. = 0.7, a value we considered to be too small to be realistic. In this simulation, the 
change of -0.15 units in ability for Type I items is accompanied by a change of only -0.105 
for Type II items. 

The results of this simulation, given in table A. 13, suggest that even under these 
conditions common item linking performed about as well as under the other conditions 
studied. The mean bias for common item linking was -0.009 standard deviations when 
there was no population change and -0.01 1 standard deviations when there was a 
population change. 

The effect was greater for IRT invariance linking than for common item linking, but not 
substantially greater than it was for IRT invariance linking with no population change. 
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The mean bias for invariance linking was -0.003 standard deviations when there was no 
population change and -0.039 standard deviations when there was a population change. 

The effects of increased multidimensionality on the extreme quantiles of the distribution 
are considerably larger than at the mean, but not substantially larger than for the cases 
with less multidimensionality. Figure B.21 uses the quantiles to illustrate the cumulative 
distribution of the original (wave 1) distribution and the four linked (wave 2) 
distributions discussed in this section. 

Effects of increasing the precision of the simulated values^ The results previously 
reported are based on 10 replications of each condition. However, since the number of 
students is relatively large (4,000 per wave) and the generating ability distribution is 
identical in each replication, the variation across replications is rather small. To 
investigate whether results would change substantially with a larger number of 
replications, 50 replications of cells 3, 5, 6, 7, 8, and 9 (where equating showed the 
poorest performance) were run and compared with the results of the first 10. There was 
no substantial change in results for the first four moments or the quantiles of the 
distributions; in most cases, the quantiles shifted slightly only in the third decimal place. 



Mathematics 

The scale linking biases tended to be somewhat larger for mathematics than the 
corresponding biases for reading. We believe that this may be associated with the lower 
correlation between background variables and the mathematics ability scales. For 
common item equating, the maximum bias in the scale mean under any of the conditions 
examined was only about 0.021 standard deviations which (given a typical NAEP scale 
standard deviation of about 40) is about 1.0 scale points. The maximum bias using strict 
IRT invariance equating was more than twice as large as that for common item equating. 
Table A. 14 presents the mean bias (mean for wave 2 minus mean for wave 1 minus the 
true change) for the 80 conditions selected from the design which were also examined for 
the reading simulation. Because the reading study suggested that IRT invariance equating 
was markedly inferior to common items equating, results for IRT invariance are presented 
only for the case of no change in population ability (as a check on previous results). 

The pattern of bias suggests a few generalizations. Common item equating generally 
appears to work best when the proportion of Type I items on the scale is the same as the 
proportion of Type I items used as common items for equating. When these proportions 
are highly unequal (that is when the common items used for equating are 
disproportionately Type I items but the entire scale is not), then equating is poorest. 
Population shifts generally, though not always, make equating more difficult. These data 
also suggest that scale linking is not necessarily any more or less biased for longer scales or 
when more of the items are taken by each student. The two cells with the largest bias in 
common item equating occurred when 12 items (one halO of a 24 item scale were taken 
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by each student, hut the next largest biases occurred when students took 24 items (one 
half) of a 48 item scale. 

Common item equating and scale linking performs remarkably well, but IRT invariance 
equating and scale linking does not, even when there is no change in the population 
mean. Table A. 14 shows that the mean bias using IRT invariance linking can be several 
times as great as that of common item linking. The biases found here could be almost 2.5 
NAEP scale points, which is not negligible in absolute terms or in comparison to typical 
NAEP sampling standard errors. However we detected some problems in determining 
convergence for IRT invariance linking in cells 16-20, which suggests that the magnitude 
of these biases may be somewhat over estimated. 

Higher scale moments • In addition to comparing the means of the equated scales, the 
variances of the wave 1 and wave 2 (linked) scales were also compared. Table A. 15 
presents the ratio of each wave 2 scale variance to that of the original (wave 1) scale. 
While it appears that the scales linked by the common item equating usually had larger 
variances than the original scale, the increase in variance is small (less than 6 percent). 
Scale variances for IRT invariance linked scales appear to be somewhat closer to the 
original scale variances, except in the case of cells 16-20 (where each person took half 
the items on a short scale). 

The third and fourth moments of the original (wave 1) and linked distributions were also 
compared. The differences between these statistics for the wave 1 distribution and those 
of the linked (wave 2) distributions are given in tables A. 16 and A. 17. Since the nature 
of the population shift from wave 1 to wave 2 is a constant movement, one would expect 
these differences to be zero if the linking were perfect. It appears from these statistics 
that common item linking performs well, even in situations where it would be expected to 
perform least well. 

Comparisom of scale quantiles. Another way the linked (wave 2) distributions were 
compared with the original (wave 1) distributions was by comparing the quantiles (the 1 
percent, 5 percent, 10 percent, 25 percent, 50 percent, 75 percent, 90 percent, 95 
percent, and 99 percent points of the distribution). Figures B.23 through B.42 use these 
quantiles to illustrate the cumulative distribution of the original (wave 1) distribution 
and the three linked (wave 2) distributions for the 20 configurations of items discussed 
above. In each case there are two groups of ogives, with the curves in each group very 
similar to one another. One group including the original (wave 1) distribution and the 
linked distributions with no population change. The other group corresponds to the 
linked distribution with the -0.15 population change. 

These figures illustrate that the distributions match reasonably well in many cells. 
However, in some situations (such as those of cells 16, 17, and 18), there are differences 
between the quantile of the linked distributions and what might be expected with perfect 
equating. These differences are often, although not always, larger in the lower quantiles 
than at the upper part of the distribution. In these situations IRT invariance equating 
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performed particularly poorly, producing biases larger than 0. 1 standard deviation at the 5 
percent point. Assuming a typical NAEP standard deviation of about 40 points, these 
biases would suggest that changes at these quantiles could he misestimated by more than 5 
NAEP scale points. These differences occur even when there are no population changes. 

Common item equating usually performed substantially better than IRT invariance 
equating, however, some of the biases are still significant at the extremes. For example in 
cells 16, 17, and 18 (where equating is generally poorest) the 5 percent points in the wave 
2 distribution obtained by common item linking differ from the corresponding quantiles 
of the wave I distribution by about 0.05, 0.05, and 0.04 standard deviations, respectively. 
These biases are statistically reliable, being several times their standard errors. Assuming a 
typical NAEP standard deviation of about 40 points, these biases would suggest that 
changes at these quantiles could he misestimated by as much as 3 NAEP scale points, 
which would not be negligible. 

Detailed information on comparisons of scale moments^ Tables A. 18 and A. 19 provide 
a more detailed report of the scale means for the 80 conditions previously discussed, 
including the standard errors of each mean. Note that the means of the original (wave 1) 
distribution are not identically zero. The reason is that, although the distribution of 
generating values may have had a mean of 0 and a variance of 1, the ability values 
estimated after scaling with a different set of items would no longer have a mean of zero. 
Since different cells in the design called for items with somewhat different characteristics, 
the means in the wave 1 distribution are slightly different in each cell of the design. 

Tables A. 20 and A. 21 provide detailed information for the scale variances. Note that the 
variances of the original (wave 1 ) distribution of abilities are not all one. As in the case of 
the means, even if the generating distribution of abilities had a mean of 1, the ability 
values estimated after scaling with a different set of items would no longer have a variance 
of 1 . Since different cells in the design called for items with somewhat different 
characteristics, the variance of the wave 1 distribution is slightly different in each cell of 
the design. 

Tables A. 22 and A. 23 give the corresponding values for the scale skewness, while tables 
A. 24 and A. 25 provide a summary of the scale kurtosis values. Tables of the quantiles are 
not included, but they have been produced and are available on request. 



Multiple Qroup IRT 

Multiple group IRT performed extraordinarily well, even better than the common item 
equating procedures we studied. The maximum bias in the scale mean under any of the 
conditions examined was less than 0.01 standard deviations which (given a typical NAEP 
scale standard deviation of about 40) is about 0.5 scale points, or half of that of common 
item equating. In most cases, the bias was so small as to be negligible in both the 
mathematics and the reading simulations. Table A. 26 presents the mean bias (mean for 
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wave 2 minus mean for wave 1 minus the true change) for 80 conditions selected from 
the design which are the conditions under which it is the most difficult to achieve 
equating and scale linking. 

Higher scale moments. In addition to comparing the means of the equated scales, the 
variances of the wave 1 and wave 2 (linked) scales were also compared. Table A. 27 
presents the ratio of each wave 2 scale variance to that of the original (wave 1) scale. 

While it appears that the scales linked by the multiple group equating usually had smaller 
variances than the original scale when there was no change and usually had larger 
variances than the original scale when there was a change, the difference in variance is 
small (less than 2 percent). 

The third and fourth moments of the original (wave 1) and linked distributions were also 
compared. The differences between these statistics for the wave 1 distribution and those 
of the linked (wave 2) distributions are given in tables A. 28 and A. 29. Since the nature 
of the population shift from wave 1 to wave 2 is a constant movement one would expect 
these differences to be zero if the linking were perfect. It appears from these statistics 
that multiple group linking performs well, even in situations where it would be expected 
to he perform least well. 

Comf>arisom of scale quantiles. Another way the linked (wave 2) distributions were 
compared with the original (wave 1) distributions was by comparing the quantiles (the 1 
percent, 5 percent, 10 percent, 25 percent, 50 percent, 75 percent, 90 percent, 95 
percent, and 99 percent points of the distribution). The distributions match 
extraordinarily well in all cells, and plots of the cumulative distribution of linked 
distributions are indistinguishable. The differences between the quantiles of the linked 
distributions are generally different by no more than might be expected due to sampling 
error if there were perfect equating. Unlike the common item equating methods studied, 
these differences are no larger in the lower quantiles than at the upper part of the 
distribution. 



Discussion 

This study suggests that the common item equating and scale linking currently used in 
NAEP perform rather well, even when the number of common items is small, each 
student takes only 25 percent of the items on a scale, the ability scale is slightly 
multidimensional, and there are changes in the ability distribution. The bias in 
estimating mean performance introduced by common item equating appears to be no 
more than about 0.01 to 0.02 standard deviations or one^half to one point on the NAEP 
scale. TTiis is small, but not entirely negligible in comparison to the sampling standard 
error at the mean for the nation as a whole. Explorations of the effect of increasing 
multidimensionality somewhat do not produce substantially larger equating bias. 
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It is important to recall that NAEP may create several subscales for a given subject area 
that are averaged to obtain an overall scale for that subject. It is tempting to believe that 
the biases in the subscales would cancel out and that the overall scale would be less biased 
than the subscales from which it is composed. This need not he the case. Consequently, 
the biases in the overall scale may not be less than that of the subscales. Indeed, they 
could be larger in comparison to the decreased standard error of the overall scale. 

To apply these results to operational NAEP one might examine the equating of the 1992 
to 1994 short-term trend scales in reading. There were three scales: reading for 
information, reading to perform a task, and reading for literary purposes. The information 
scale had 40 items, half of which were constructed response, but only 13 percent of the 
constructed response items were used as common items for equating. Therefore, the 
situation for the information scale most resembles cells 1 or 6. The literary and task scales 
had 20 and 27 items respectively, of which 65 percent and 59 percent were constructed 
response, but only 50 percent and 45 percent of the constructed response items were used 
for equating. The situation for these two scales resembles cells 13 or 18. This analogy 
suggests that the bias should be between 0.001 and 0.01 1 standard deviations (0.0 to 0.5 
NAEP scale points) for the information scale and 0.003 and 0.006 standard deviations 
(0.1 to 0.3 NAEP scale points) for the other two scales. 

It might he advisable to consider equating as introducing as much as 0.5 to 1 .0 points of 
bias in trend comparisons. Thus a viable procedure might be to test for differences 
between assessment waves by testing whether the difference is greater than 1.0 scale units 
(the maximum equating bias found here). Alternatively, one might increase the sampling 
standard error by a fraction that would accomplish approximately the same result as a way 
to characterize the contribution of equating bias to uncertainty. That is, one might treat 
equating error as a fixed component in the variance of the difference between assessment 
wave means. Assuming approximately equal sample sizes in each assessment wave, this 
leads to a standard error for the difference of the form: 

- VSE,= + SE/ + oV2 



where 5 is the equating bias (e.g., 0.5). 

Results for scale quantiles suggest more caution. While there was generally only small bias 
in the scale quantiles due to linking, in some cases the bias in the quantiles was 
substantial, up to 5 times that of the mean. This suggests that scale linking can pose 
problems for inferences about changes in the extremes of the distribution or about groups 
whose scores tend to be extreme. In the cases where linking was poorest, the 5 percent, 10 
percent, and even 25 percent points shifted by an amount equivalent to as much as 2 
NAEP scale points. This may be particularly important if the performance of 
disadvantaged groups, who tend to score substantially below the mean, continues to be an 
important national policy interest. 
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This simulation did not address the question of multiple linkings over more than two 
waves of data collection. While it would be naive to assume that worst'Case biases would 
simply compound over years, it is not clear exactly how much biases increase after linking 
of several waves of assessments. On the other hand it seems realistic to assume that there 
is some compounding, and that the effects on extreme quantiles could be substantial. 

Even the effect on means could be nonnegligible after, say, five waves of data collection. 
For example, a bias in one direction of 0.005 standard deviations compounding over five 
waves of assessment could become a total bias of 0.025 standard deviations or about a 
scale point. Recalling that the bias at extreme quantiles could be five times as large, such 
compounding could correspond to a bias of several points at the quantiles. 

Multiple group IRT models have great promise as alternatives to equating and linking 
based on single group IRT methods. When a multiple group model was used in this 
simulation, nearly all of the bias was eliminated, and the linked distribution was virtually 
indistinguishable from what would have been expected if there were perfect equating. 

The multiple group method is useful when two waves of data can be scaled together (for 
example, when an entire trend series is computed at once), and the advantages should be 
even greater when more than two waves of data are linked. On the other hand, multiple 
group equating would not have much advantage over conventional IRT methods if, for 
example, the first wave of the data was scaled separately and multiple group methods 
could be applied only to the second wave of the data, since the equating would not be 
provided internally by the multiple group model. 

Trend reporting in NAEP has not, up until now, involved revisions to previous reports. 
However it is possible to introduce such revisions. Social statistics of many kinds are 
revised from time to time, and even values of fundamental physical constants are subject 
to periodic redetermination that alters their values. The revision of scores that occurs in 
multiple group IRT is a consequence of additional information (the second wave of data) 
which increases the precision of estimates of the scores in the first wave of data. There is a 
revealing parallel in the determination of values of fundamental physical constants. 
Experiments which estimate these constants must rely on data from other experiments 
which measure related constants or the relations among constants. The value of a 
constant may need to be revised when better data on related constants is obtained. We 
regard the revision of first wave of scores in multiple group IRT as logically equivalent to 
the revision of the value of a physical constant given new data on a related constant. 
While retrospective redetermination of individual test scores might pose problems, 
individual test scores are not provided by NAEP or other assessment programs that focus 
on population distributions. The merits of less biased measurements may outweigh the 
problems caused by slight adjustments to scores, particularly in long trend lines where 
equating and linking errors are likely to be greatest. 

Finally, we noted that our simulations were sensitive to the background variable used in 
the conditioning process. We believe that most of the differences between mathematics 
and reading that we observed were a consequence of the fact that the correlation between 
the background variables and mathematics ability was lower than the correlation between 
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the background variables and reading (R^ = 0.25 versus = 0.52). It is clear that changes 
in the background variables or their relation to achievement can affect the ability 
distribution generated through multiple imputation. Our simulations relied on a 
correlation structure with the background variables that did not change over time. That 
structure may not remain constant if background variables or the process by which they 
are collected are changed. The maintenance of a constant set of background variables for 
conditioning of (short- or long-term) trend data is a consideration that should not be 
overlooked in the operation of NAER 



Recommendations 

This research suggests some practical recommendations for practice in NAEP and other 
large scale assessments using NAEP-like procedures. 

1. Even in the most difficult conditions usually encountered, the common item 
equating and scale linking procedures currently used in operational NAEP 
appear to introduce relatively little bias (less than one NAEP scale point) in 
comparisons of the means of two waves of data. There should be little bias 
also in comparisons of subgroup means that are relatively near the center of 
the overall populations. The fact that these procedures are also 
straightforward and well understood supports their continuation. 

2. The common item equating and scale linking procedures currently used in 
operational NAEP introduce substantially more bias (up to two NAEP scale 
points) in comparisons of the extreme percentiles of two populations. We 
recommend caution in comparisons of extreme percentiles over time or 
comparisons over time of the means of population subgroups which differ 
substantially from the overall population mean. Such cautions would apply 
also to examination of trends over time in proportions of the population at 
extremely high achievement levels. In these cases, the sampling standard 
errors may substantially understate the true uncertainties of trends. In such 
cases the use of a conservative test that the scale difference is larger than 
some nonzero value (e.g., 2 NAEP scale points) may be warranted as a test of 
the null hypothesis of no trend. 

3. Strict IRT invariance equating and scale linking should not be used in NAEP 
or other large scale assessments. It introduces substantially more bias than the 
procedures currently used in NAER 

4. Multiple group IRT methods have considerable scientific merit for equating 
and scale linking. These methods have the potential of practically 
eliminating bias in scale linking, even in the situations where current 
methods are weakest. When all waves of data can be analyzed together. 
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multiple group IRT has no apparent disadvantages. When all waves of data 
cannot be scaled together (as in NAEP trend reporting), multiple group IRT 
methods have the disadvantage that the linking of a second (or later) wave of 
data alters scores on the first wave of data. We believe that this is not a fatal 
flaw. Social statistics of many kinds are revised from time to rime and even 
values of fundamental physical constants are subject to periodic 
redetermination that alters their values. The revision of scores that occurs in 
multiple group IRT is a consequence of additional information (the second 
wave of data) which increases the precision of estimates of the scores in the 
first wave of data. While retrospective redetermination of individual test 
scores might pose problems, individual test scores are not provided by NAEP 
or other assessment programs that focus on population distributions. We 
believe that the merits of less biased measurements may outweigh the 
problems caused by slight adjustments to scores. 

5. Although current NAEP procedures appear adequate for comparisons of 
population means across two or three waves of data, they do not ensure that 
equating and linking biases will not compromise long trend lines and 
particularly trends of extreme percentiles. Therefore, the data underlying 
long trend lines should be periodically reanalyzed using methods, such as 
multiple group IRT, which can minimize equating and linking bias. 
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Section A: Tables 



Table A.1 


Average Scale Linking Bias (Reading Simulation) 




Cell 




Number of: 




No Population Chanee 


'0.15 Population Chanee 


Common Type 1 

Items Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Invariance 

Equating 


Common 

Item 

Equating 


48 Total Items, 


/ 2 hems Taken 












1 


6 


6 


1 


^0.001 


'0.006 


0.010 


0.002 


2 


6 


6 


3 


^0.008 


'0.002 


0.011 


0.002 


3 


6 


7 


1 


^0.000 


'0.001 


0.021 


0.003 


4 


6 


8 


3 


^0.003 


'0.001 


0.018 


0.006 


5 


12 


6 


6 


0.022 


'0.004 


0.025 


'0.007 


48 Total hems, 


24 hems Taken 












6 


12 


12 


2 


0.002 


'0.006 


0.032 


0.011 


7 


12 


12 


6 


^0.003 


'0.004 


0.028 


0.010 


8 


12 


14 


2 


^0.004 


'0.010 


0.037 


0.007 


9 


12 


16 


6 


^0.002 


'0.006 


0.027 


0.008 


10 


24 


12 


12 


0.000 


0.004 


0.009 


'0.001 


24 Total hems. 


6 hems Taken 












11 


3 


3 


1 


'0.001 


0.004 


0.005 


0.004 


12 


3 


3 


2 


'0.001 


0.001 


'0.001 


0.006 


13 


3 


4 


1 


0.001 


0.004 


0.009 


0.005 


14 


3 


4 


2 


0.001 


0.005 


0.005 


0.005 


15 


6 


3 


3 


'0.002 


0.002 


'0.007 


0.000 


24 Total hems. 


12 Items Taken 












16 


6 


6 


2 


0.000 


'0.000 


0.014 


0.007 


17 


6 


6 


4 


0.001 


0.004 


0.012 


0.001 


18 


6 


8 


2 


0.00 1 


'0.003 


0.016 


0.003 


19 


6 


8 


4 


'0.000 


'0.002 


0.012 


0.003 


20 


12 


6 


6 


-0.00 1 


0.000 


'0.002 


'0.002 



Note: Standard errors are typically less than 0.003. 
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Table A.2 Ratio of Wave 2 to Wave 1 Scale Variances (Reading Simulation) 



Number of: No Population Change -0.15 Population Change 



Cell 


Common 

Items 


Type 1 
Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Invariance 

Equating 


Common 

Item 

Equating 


48 Total Items, 12 Items Taken 












1 


6 


6 


1 


0.997 


1.036 


0.991 


1.025 


2 


6 


6 


3 


1.003 


1.029 


0.991 


1.030 


3 


6 


7 


1 


0.999 


1.042 


0.986 


1.041 


4 


6 


8 


3 


0.996 


1.027 


0.994 


1.027 


5 


12 


6 


6 


0.992 


1.028 


0.977 


1.029 


48 Total Items, 24 Items Taken 












6 


12 


12 


2 


1.003 


1.055 


0.988 


1.054 


7 


12 


12 


6 


1.000 


1.052 


0.990 


1.051 


8 


12 


14 


2 


0.996 


1.048 


0.984 


1.051 


9 


12 


16 


6 


0.998 


1.055 


0.997 


1.055 


10 


24 


12 


12 


0.992 


1.054 


0.984 


1.055 


24 Total Items, 6 /terns Taken 












11 


3 


3 


1 


0.997 


0.986 


0.994 


0.988 


12 


3 


3 


2 


0.996 


0.984 


0.997 


0.985 


13 


3 


4 


1 


0.993 


0.992 


0.994 


0.988 


14 


3 


4 


2 


1.002 


0.992 


1.007 


0.986 


15 


6 


3 


3 


0.996 


0.984 


1.001 


0.974 


24 Total Items, 12 Items Taken 












16 


6 


6 


2 


0.996 


1.015 


0.996 


1.026 


17 


6 


6 


4 


1.001 


1.029 


1.001 


1.023 


18 


6 


8 


2 


1.006 


1.025 


0.996 


1.026 


19 


6 


8 


4 


0.996 


1.034 


0.998 


1.024 


20 


12 


6 


6 


0.996 


1.013 


0.995 


1.021 




28 



A Study of Equating in NAEP 



Table A.3 Differences Between Wave 2 and Wave 1 Scale Skewness 



(Reading Simulation) 







Number of: 




No Population Change 


-0.15 Population Change 


Cell 


Common 

Items 


Type I 
Items 


Common 
Type I 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Invariance 

Equating 


Common 

Item 

Equating 


48 Total Items, 12 hems Taken 












1 


6 


6 


1 


0.009 


0.020 


0.02 1 


-0.007 


2 


6 


6 


3 


-0.001 


-0.009 


-0.019 


-0.011 


3 


6 


7 


1 


0.006 


0.004 


0.002 


0.002 


4 


6 


8 


3 


-0.004 


-0.020 


-0.006 


-0.002 


5 


12 


6 


6 


0.011 


0.019 


0.008 


-0.008 


48 Total Items, 24 items Taken 












6 


12 


12 


2 


-0.011 


-0.015 


-0.018 


-0.016 


7 


12 


12 


6 


0.006 


0.016 


-0.014 


0.006 


8 


12 


14 


2 


0.005 


0.000 


0.016 


0.003 


9 


12 


16 


6 


0.003 


0.013 


-0.006 


0.000 


10 


24 


12 


12 


-0.017 


0.003 


-0.018 


0.018 


24 Total items, 6 Items Taken 












11 


3 


3 


1 


-0.012 


-0.021 


0.002 


-0.009 


12 


3 


3 


2 


0.003 


-0.013 


-0.015 


-0.011 


13 


3 


4 


1 


0.001 


-0.007 


0.008 


-0.009 


14 


3 


4 


2 


0.006 


-0.019 


-0.018 


-0.025 


15 


6 


3 


3 


-0.011 


-0.025 


-0.008 


-0.025 


24 Total Items, 12 Items Taken 












16 


6 


6 


2 


0.003 


-0.004 


0.013 


-0.000 


17 


6 


6 


4 


-0.021 


-0.007 


-0.012 


-0.018 


18 


6 


8 


2 


-0.003 


0.005 


0.015 


-0.009 


19 


6 


8 


4 


-0.002 


0.005 


0.001 


-0.006 


20 


12 


6 


6 


-0.016 


-0.016 


-0.024 


-0.020 



Note: Standard errors are typically below 0.008. 
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Table A.4 


Differences Between Wave 2 and Wave 1 Scale Kurtosis 
(Reading Simulation) 




Cell 




Number of: 




No Population Change 


-0.15 Population Change 


Common Type 1 

Items Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Invariance 

Equating 


Common 

Item 

Equating 


48 Total Items, 12 Items Taken 












1 


6 


6 


1 


^0.000 


-0.013 


0.013 


-0.019 


2 


6 


6 


3 


0.030 


0.038 


0.053 


0.025 


3 


6 


7 


1 


-0.023 


-0.036 


-0.023 


-0.033 


4 


6 


8 


3 


0.008 


-0.036 


-0.012 


-0.034 


5 


12 


6 


6 


0.013 


-0.023 


0.016 


0.007 


48 Total Items, 24 /terns Taken 












6 


12 


12 


2 


-0.025 


0.029 


0.011 


0,016 


7 


12 


12 


6 


0.000 


0.008 


0.004 


0,013 


8 


12 


14 


2 


0.001 


0.030 


0.011 


0.037 


9 


12 


16 


6 


-0.001 


0.037 


0.009 


0.025 


10 


24 


12 


12 


0.017 


0.038 


0.017 


0.023 


24 Total Items, 6 


/term Taken 












11 


3 


3 


1 


0.005 


0.007 


0.004 


-0.006 


12 


3 


3 


2 


-0.019 


-0.029 


-0.014 


-0.028 


13 


3 


4 


1 


-0.030 


-0.025 


0.006 


-0.032 


14 


3 


4 


2 


0.030 


0.001 


0.043 


0.031 


15 


6 


3 


3 


0.006 


-0.004 


-0.002 


-0.008 


24 Total Items, 12 items Taken 












16 


6 


6 


2 


0.032 


0.008 


0.017 


0.002 


17 


6 


6 


4 


0.029 


-0.003 


0.010 


-0.009 


18 


6 


8 


2 


0.008 


-0.025 


0.003 


0.000 


19 


6 


8 


4 


0.026 


-0.007 


0.025 


0.018 


20 


12 


6 


6 


-0.002 


-0.009 


0.009 


-0.040 



Note: Standard errors are typically below 0.025. 
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Table A.6 24 Total Items: Scale Means (1 00*Standard Error) for Two Equating Methods (Reading Simulation) 
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Table A.13 Moments and Quantiles for Cell 8 (Reading Simulation) with X = 0.7 
(1 00*Standard Error) 



No Population Change -0.15 Population Change 





Wave 1 


Invariance 

Equating 


Common 

Item 

Equating 


Invariance 

Equating 


Common 

Item 

Equating 


Moments 


Mean 


'0.005 


'0.008 


-0.014 


-0.116 


'0.144 




(0.121) 


(0.245) 


(0.165) 


(0.154) 


(0.194) 


Variance 


1.141 


1.146 


1.212 


1.129 


1.202 




(0.278) 


(0.403) 


(0.212) 


(0.306) 


(0.222) 


Skew 


'0.100 


'0.100 


-0.117 


-0.106 


'0.105 




(1.116) 


(0.849) 


(0.639) 


(0.523) 


(0.749) 


Kurt os is 


'0.065 


'0.075 


-0.099 


-0.072 


'0.115 




(1.458) 


(0.877) 


(1.170) 


(1.650) 


(1.028) 


Quantiles 


1% 


-2.541 


-2.549 


'2.639 


-2.648 


-2.746 




(1.089) 


(0.858) 


(0.970) 


(1.108) 


(1.204) 


5% 


-1.805 


-1.807 


'1.872 


-1.901 


-1.988 




(0.633) 


(0.510) 


(0.565) 


(0.382) 


(0.601) 


10% 


-1.406 


-1.411 


'1.459 


-1.505 


-1.576 




(0.501) 


(0.428) 


(0.406) 


(0.413) 


(0.326) 


25% 


-0.723 


-0.731 


'0.753 


-0.829 


-0.888 




(0.238) 


(0.432) 


(0.367) 


(0.180) 


(0.315) 


50% 


0.020 


0.014 


0.016 


-0.097 


-0.118 




(0.350) 


(0.419) 


(0.247) 


(0.283) 


(0.315) 


75% 


0.731 


0.729 


0.746 


0.617 


0.616 




(0.250) 


(0.296) 


(0.316) 


(0.269) 


(0.295) 


90% 


1.350 


1.353 


1.382 


1.233 


1.252 




(0.268) 


(0.396) 


(0.303) 


(0.442) 


(0.484) 


95% 


1.713 


1.713 


1.753 


1.597 


1.620 




(0.602) 


(0.460) 


(0.392) 


(0.269) 


(0.586) 


99% 


2.387 


2.373 


2.417 


2.246 


2.283 




(1.133) 


(1.080) 


(0.937) 


(0.551) 


(0.660) 
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Table A.14 Average Scale Linking Bias (Mathematics Simulation) 


Cell 




Number of: 




No Population Change -0.15 Population Change 


Common 

Items 


Type 1 
Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Common 

Item 

Equating 


48 Total Items, 12 Items Taken 










1 


6 


6 


1 


-0.003 


-0.013 


-0.004 


2 


6 


6 


3 


0.001 


-0.004 


0.005 


3 


6 


7 


1 


-0.004 


-0.012 


0.006 


4 


6 


8 


3 


-0.007 


-0.006 


0.006 


5 


12 


6 


6 


0.000 


-0.001 


-0.008 


48 Total Items, 24 Items Taken 










6 


12 


12 


2 


-0.004 


-0.016 


0.003 


7 


12 


12 


6 


-0.000 


-0.010 


0.008 


8 


12 


14 


2 


0.000 


-0.013 


0.001 


9 


12 


16 


6 


-0.001 


-0.01 1 


0.006 


10 


24 


12 


12 


0.002 


-0.006 


-0.007 


24 Total Items, 6 Items Taken 










11 


3 


3 


1 


0.001 


0.000 


-0.002 


12 


3 


3 


2 


0.001 


0.003 


0.010 


13 


3 


4 


1 


0.002 


-0.002 


0.001 


14 


3 


4 


2 


-0.004 


-0.005 


0.001 


15 


6 


3 


3 


0.003 


0.007 


0.003 


24 Total Items, 12 items Taken 










16 


6 


6 


2 


0.017 


-0.009 


0.008 


17 


6 


6 


4 


0.031 


-0.004 


0.021 


18 


6 


8 


2 


0.021 


-0.012 


-0.001 


19 


6 


8 


4 


0.049 


-0.004 


0.019 


20 


12 


6 


6 


0.022 


0.005 


-0.003 



Note: Standard errors are typically less than 0.003. 
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Table A.15 Ratio of Wave 2 to Wave 1 Scale Variances (Mathematics Simulation) 



Cell 




Number of: 




No Population Change 


-0.15 Population Change 


Common 

Items 


Type 1 
Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Common 

Item 

Equating 


48 Total Items, 1 2 Items Taken 










1 


6 


6 


1 


1.002 


1.050 


1.042 


2 


6 


6 


3 


0.991 


1.037 


1.037 


3 


6 


7 


1 


0.998 


1.046 


1.038 


4 


6 


8 


3 


0.986 


1.043 


1.036 


5 


12 


6 


6 


0.985 


1.040 


1.029 


48 Total Items, 24 /terns Taken 










6 


12 


12 


2 


0.997 


1.028 


1.017 


7 


12 


12 


6 


0.999 


1.026 


1.022 


8 


12 


14 


2 


0.999 


1.027 


1.017 


9 


12 


16 


6 


0.995 


1.018 


1.027 


10 


24 


12 


12 


0.993 


1.034 


1.019 


24 Total Items, 6 Items Taken 










11 


3 


3 


1 


0.994 


1.017 


1.015 


12 


3 


3 


2 


1.015 


1.027 


1.030 


13 


3 


4 


1 


0.977 


1.019 


1.003 


14 


3 


4 


2 


1.002 


1.028 


1.018 


15 


6 


3 


3 


1.002 


0.975 


0.969 


24 Total Items, 12 /terns Taken 










16 


6 


6 


2 


0.864 


1.053 


1.037 


17 


6 


6 


4 


0.893 


1.058 


1.049 


18 


6 


8 


2 


0.799 


1.051 


1.051 


19 


6 


8 


4 


0.812 


1.043 


1.039 


20 


12 


6 


6 


0.896 


1.046 


1.041 
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TableA.16 


Differences Between Wave 2 and Wave 1 Scale Skewness 
(Mathematics Simulation) 


Cell 




Number of: 




No Population Change 


-0.15 Population Change 


Common Type 1 

Items Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Common 

Item 

Equating 


48 Total Items, 


12 Items Taken 










1 


6 


6 


1 


0.001 


-0.024 


-0.031 


2 


6 


6 


3 


-0.010 


-0.01 1 


-0.023 


3 


6 


7 


1 


-0.001 


-0.015 


-0.028 


4 


6 


8 


3 


-0.002 


-0.016 


-0.025 


5 


12 


6 


6 


-0.008 


-0.036 


-0.030 


48 Total Items, 


24 Items Taken 










6 


12 


12 


2 


-0.014 


-0.045 


-0.042 


7 


12 


12 


6 


-0.006 


-0.028 


-0.041 


8 


12 


14 


2 


-0.012 


-0.040 


-0.035 


9 


12 


16 


6 


0.002 


-0.031 


-0.030 


10 


24 


12 


12 


-0.007 


-0.010 


-0.015 


24 Total items, i 


6 /terns Taken 










11 


3 


3 


1 


-0.015 


0.004 


0.003 


12 


3 


3 


2 


0.001 


0.004 


-0.005 


13 


3 


4 


1 


0.024 


0.022 


0.030 


14 


3 


4 


2 


0.001 


0.008 


0.002 


15 


6 


3 


3 


-0.014 


-0.002 


0.013 


24 Total Items, 


12 Items Taken 










16 


6 


6 


2 


0.059 


-0.017 


-0.021 


17 


6 


6 


4 


0.070 


-0.018 


-0.008 


18 


6 


8 


2 


0.080 


-0.014 


-0.039 


19 


6 


8 


4 


0.141 


-0.005 


-0.019 


20 


12 


6 


6 


0.068 


-0.022 


-0.031 


Note: Standard 


1 errors are typically less than 0.008. 
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TableA.17 


Differences Between Wave 2 and Wave 1 Scale Kurtosis 
(Mathematics Simulation) 


Cell 




Number of: 




No Population Change 


-0.15 Population Change 


Common Type 1 

Items Items 


Common 
Type 1 
Items 


Invariance 

Equating 


Common 

Item 

Equating 


Common 

Item 

Equating 


48 Total Items, 


1 2 Items Taken 










1 


6 


6 


1 


0.011 


0.004 


-0.008 


2 


6 


6 


3 


-0.018 


0.001 


-0.035 


3 


6 


7 


1 


-0.014 


-0.012 


-0.023 


4 


6 


8 


3 


-0.018 


-0.039 


-0.022 


5 


12 


6 


6 


-0.021 


-0.024 


-0.042 


48 Total Items, 


24 Items Taken 










6 


12 


12 


2 


0.022 


0.004 


-0.023 


7 


12 


12 


6 


0.024 


0.024 


-0.009 


8 


12 


14 


2 


-0.010 


-0.006 


-0.023 


9 


12 


16 


6 


-0.026 


-0.016 


-0.034 


10 


24 


12 


12 


-0.017 


0.016 


0.000 


24 Total Items, 


6 /terns Taken 










11 


3 


3 


1 


-0.016 


-0.028 


-0.071 


12 


3 


3 


2 


0.019 


-0.045 


-0.046 


13 


3 


4 


1 


-0.001 


-0.019 


-0.045 


14 


3 


4 


2 


-0.012 


-0.057 


-0.070 


15 


6 


3 


3 


-0.021 


-0.085 


-0.069 


24 Total items, 


12 Items Taken 










16 


6 


6 


2 


-0.103 


-0.008 


-0.029 


17 


6 


6 


4 


-0.059 


0.006 


0.000 


18 


6 


8 


2 


-0.150 


0.012 


-0.025 


19 


6 


8 


4 


-0.117 


-0.017 


-0.059 


20 


12 


6 


6 


-0.062 


0.030 


0.000 


Note: Standard 


1 errors are typically less than 0.025. 
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0.050 ' 0.018 0.072 0.082 

( 0 . 770 ) ( 1 . 698 ) ( 0 . 822 ) ( 0 . 640 ) 



Table A.24 48 Total Items: Scale Kurtosis (1 00*Standard Error) for Two Equating Methods (Mathematics Simulation) 
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(0.539) (1.448) (1.491) (1.383) 

-0.245 -0.219 -0.230 -0.211 

(1.079) (1-488) (1.094) (0.940) 

-0.260 -0.243 -0.276 -0.260 

(0.687) (0.993) (0.721) (1.142) 
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(0.707) (3.613) (0.820) (1.239) 

,0.156 '0.094 '0.186 -0.156 

(0.999) (3.229) (0.687) (0.945) 



Table A.26 Average Scale Linking Bias Using Multiple Group IRT 



Cell 




Number of: 




No Population Change 


'0.15 Population Change 


Common 

Items 


Type 1 
Items 


Common 
Type 1 
Items 


Reading 


Math 


Reading 


Math 


48 Total Items, 12 Items Taken 












1 


6 


6 


1 


^0.003 


'0.001 


'0.003 


'0.002 


2 


6 


6 


3 


0.003 


'0.003 


0.000 


'0.007 


3 


6 


7 


1 


0.001 


0.005 


'0.014 


0.002 


4 


6 


8 


3 


0.001 


'0.008 


0.001 


'0.002 


5 


12 


6 


6 


'0.005 


'0.004 


'0.003 


0.006 


48 Total Items, 24 Items Taken 












6 


12 


12 


2 


'0.007 


'0.002 


0.001 


'0.005 


7 


12 


12 


6 


'0.008 


0.002 


'0.004 


0.000 


8 


12 


14 


2 


'0.011 


0.001 


'0.003 


0.002 


9 


12 


16 


6 


0.005 


'0.004 


'0.001 


'0.010 


10 


24 


12 


12 


'0.001 


0.001 


0.000 


0.001 


24 Total Items, 6 /terns Taken 












11 


3 


3 


1 


0.001 


'0.004 


'0.006 


'0.003 


12 


3 


3 


2 


0.000 


0.003 


0.002 


0.005 


13 


3 


4 


1 


'0.001 


'0.009 


'0.001 


'0.002 


14 


3 


4 


2 


'0.001 


0.004 


'0.011 


0.004 


15 


6 


3 


3 


0.002 


'0.000 


0.000 




24 Total Items, 12 Items Taken 












16 


6 


6 


2 


0.004 


0.001 


0.003 


0.001 


17 


6 


6 


4 


0.006 


'0.004 


0.005 


'0.007 


18 


6 


8 


2 


'0.006 


'0.005 


'0.008 


'0.002 


19 


6 


8 


4 


'0.003 


0.012 


'0.004 


0.005 


20 


12 


6 


6 


'0.005 


0.004 


0.000 


'0.001 


Note: Standard errors are typically less than 0.003. 
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Table A.27 Ratio of Wave 2 to Wave 1 Variance Using Multiple Group IRT 


Cell 




Number of: 




No Population Change 


-0.15 Population Change 


Common 

Items 


Type 1 
Items 


Common 
Type 1 
Items 


Reading 


Math 


Reading 


Math 


48 Total Items, 12 Items Taken 












1 


6 


6 


1 


1.002 


0.988 


0.991 


0.983 


2 


6 


6 


3 


0.996 


0.991 


0.998 


0.985 


3 


6 


7 


1 


0.993 


0.988 


1.001 


0.995 


4 


6 


8 


3 


0.993 


1.000 


0.996 


0.994 


5 


12 


6 


6 


1.001 


1.001 


1.002 


0.992 


48 Total Items, 24 /tem5 Taken 












6 


12 


12 


2 


0.989 


0.985 


0.995 


0.977 


7 


12 


12 


6 


1.009 


1.000 


1.003 


0.979 


8 


12 


14 


2 


0.992 


0.981 


0.990 


0.983 


9 


12 


16 


6 


0.981 


0.991 


0.990 


0.982 


10 


24 


12 


12 


0.995 


1.010 


0.999 


0.996 


24 Total Items, 6 Items Taken 












11 


3 


3 


1 


1.008 


1.005 


0.994 


0.992 


12 


3 


3 


2 


1.008 


0.997 


1.003 


0.988 


13 


3 


4 


1 


1.001 


0.990 


0.998 


0.983 


14 


3 


4 


2 


0.993 


1.002 


0.995 


0.994 


15 


6 


3 


3 


0.994 


1.005 


0.997 




24 Total Items, 1 2 Items Taken 












16 


6 


6 


2 


1.009 


1.010 


1.012 


0.986 


17 


6 


6 


4 


0.998 


0.993 


1.007 


0.995 


18 


6 


8 


2 


0.990 


0.990 


0.991 


0.977 


19 


6 


8 


4 


1.002 


0.999 


0.995 


0.990 


20 


12 


6 


6 


1.004 


1.006 


0.997 


0.988 
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Table A.28 


Difference Between Wave 2 and Wave 1 Skewness Using Multiple 
Group IRT 


Cell 




Number of: 




No Population Change 


-0.15 Population Change 


I 

Common Type 1 

Items Items 


Common 
Type 1 
Items 


Reading 


Math 


Reading 


Math 


48 Total Items, 


] 2 Items Taken 












1 


6 


6 


1 


0.009 


-0.021 


-0.001 


-0.008 


2 


6 


6 


3 


0.002 


0.003 


0.001 


0.004 


3 


6 


7 


1 


-•0.010 


0.001 


-0.008 


-0.009 


4 


6 


8 


3 


0.002 


-0.028 


-0.005 


-0.011 


5 


12 


6 


6 


0.003 


0.001 


-0.008 


0.004 


48 Total /term, 


24 Items Taken 












6 


12 


12 


2 


-0.003 


-0.045 


0.000 


-0.023 


7 


12 


12 


6 


0.003 


-0.026 


-0.010 


-0.007 


8 


12 


14 


2 


-0.027 


-0.068 


-0.032 


-0.047 


9 


12 


16 


6 


-0.001 


-0.010 


-0.002 


-0.023 


10 


24 


12 


12 


-0.009 


-0.027 


-0.021 


-0.01 1 


24 Total items , i 


6 Items Taken 












11 


3 


3 


1 


0.010 


0.018 


-0.004 


0.011 


12 


3 


3 


2 


-0.010 


-0.014 


-0.018 


0.010 


13 


3 


4 


1 


-0.009 


0.011 


-0.003 


0.003 


14 


3 


4 


2 


0.002 


0.006 


-0.009 


-0.002 


15 


6 


3 


3 


0.011 


-0.003 


-0.003 




24 Total Items , 


12 Items Taken 












16 


6 


6 


2 


0.004 


-0.004 


0.000 


0.000 


17 


6 


6 


4 


-0.003 


0.010 


-0.020 


0.006 


18 


6 


8 


2 


-0.015 


0.010 


0.004 


-0.01 1 


19 


6 


8 


4 


-0.003 


0.015 


-0.020 


0.016 


20 


12 


6 


6 


-0.017 


-0.003 


-0.005 


0.000 



Note; Standard errors are typically less than 0.008. 
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Table A.29 


Difference Between Wave 2 and Wave 1 Kurtosis Using Multiple 
Group IRT 






Number of: 




No Populatior 


I Change 


-0.15 Population Change 


Cell 


Common Type I 

Items Items 


Common 
Type 1 
Items 


Reading 


Math 


Reading 


Math 


48 Total Items, 


12 Items Taken 












1 


6 


6 


1 


0.007 


-0.030 


0.016 


-0.016 


2 


6 


6 


3 


0.000 


0.006 


0.002 


-0.023 


3 


6 


7 


1 


0.014 


-0.039 


-0.020 


-0.043 


4 


6 


8 


3 


-0.024 


-0.020 


0.005 


-0.011 


5 


12 


6 


6 


-0.018 


-0.006 


-0.020 


-0.005 


48 Total Items, 


24 Items Taken 












6 


12 


12 


2 


0.031 


-0.012 


0.013 


-0.039 


7 


12 


12 


6 


0.008 


0.003 


-0.01 1 


0.005 


8 


12 


14 


2 


0.044 


-0.038 


0.018 


-0.027 


9 


12 


16 


6 


-0.035 


0.003 


0.001 


-0.022 


10 


24 


12 


12 


0.004 


-0.005 


0.019 


-0.019 


24 Total Items, 


6 Items Talcen 












11 


3 


3 


1 


-0.006 


0.012 


0.015 


-0.013 


12 


3 


3 


2 


0.002 


-0.019 


-0.001 


-0.008 


13 


3 


4 


1 


0.010 


0.010 


-0.012 


-0.009 


14 


3 


4 


2 


-0.005 


-0.012 


-0.007 


-0.025 


15 


6 


3 


3 


0.000 


-0.009 


-0.047 




24 Total Items, 


1 2 Items Taken 












16 


6 


6 


2 


-0.003 


-0.004 


-0.011 


-0.029 


17 


6 


6 


4 


0.002 


0.016 


0.006 


0.003 


18 


6 


8 


2 


-0.019 


-0.019 


0.021 


-0.035 


19 


6 


8 


4 


0.000 


-0.007 


0.026 


-0.001 


20 


12 


6 


6 


0.039 


-0.012 


0.020 


-0.025 



Note: Standard errors are typically less than 0.025. 
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Section B: Figures 



Figure B.1a. Histogram of the Generating Populations of Abilities: Reading 




- 3.5 - 1.5 0.5 1.5 2.5 3.5 



Reading Ability 

Figure B.1b. Histogram of the Generating Populations of Abilities: Mathematics 
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Mathematics Ability 
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Figure B.2. Quantiles of Scale Distributions for Cell 1 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.3. Quantiles of Scale Distributions for Cell 2 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.4. Quantiles of Scale Distributions for Cell 3 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.5. Quantiles of Scale Distributions for Cell 4 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.6. Quantiles of Scale Distributions for Cell 5 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.7. Quantiles of Scale Distributions for Cell 6 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.8. Quantiles of Scale Distributions for Cell 7 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.9. Quantiles of Scale Distributions for Cell 8 (Reading Simulation) 




-2.7103 -0.8806 0.6160 2.2580 

Quantiles of Distributions of Scaled Scores 
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Figure B.10. Quantiles of Scale Distributions for Cell 9 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.11. Quantiles of Scale Distributions for Cell 10 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.12. Quantiles of Scale Distributions for Cell 11 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.13. Quantiles of Distribution for Cell 12 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.14. Quantiles of Scale Distributions for Cell 13 (Reading Simulation) 




Quantiles of Distributions of Scaied Scores 



Figure B.15. Quantiles of Scale Distributions for Cell 14 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.16. Quantiles of Scale Distributions for Cell 15 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.17. Quantiles of Scale Distributions for Cell 16 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.18. Quantiles of Scale Distributions for Cell 17 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.19. Quantiles of Scale Distributions for Cell 18 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



66 



A Study of Equating in NAEP 




BEST COPY AVAILABLE 



Figure B.20. Quantiles of Scale Distributions for Cell 19 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 



Figure B.21. Quantiles of Scale Distributions for Cell 20 (Reading Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.22. Quantiles of Scale Distributions for Cell Eight (Reading Simulation), 
with Increased Multidimensionality X = 0.7 




Quantiles of Distributions of Scaled Scores 
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Figure B.23. Quantiles of Scale Distributions for Cell 1 (Mathematics Simulation) 



Math: Cell One 




Quantiles of Distributions of Scaled Scores 



Figure B.24. Quantiles of Scale Distributions for Cell 2 (Mathematics Simulation) 



Math; Cell Two 




Quantiles of Distributions of Scaled Scores 
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Figure B.25. Quantiles of Scale Distributions for Cell 3 (Mathematics Simulation) 



Math; Cell Three 
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Quantiles of Distributions of Scaled Scores 



Figure B.26. Quantiles of Scale Distributions for Cell 4 (Mathematics Simulation) 



Math: Cell Four 




-2.3624 -0.7226 0.7491 2.4961 



Quantiles of Distributions of Scaled Scores 
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Figure B.27. Quantiles of Scale Distributions for Cell 5 (Mathematics Simulation) 



Math: Cell Five 




Quantiles of Distributions of Scaled Scores 



Figure B.28. Quantiles of Scale Distributions for Cell 6 (Mathematics Simulation) 



Math: Cell Six 




Quantiles of Distributions of Scaled Scores 
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Figure B.29. Quantiles of Scale Distributions for Cell 7 (Mathematics Simulation) 



Math: Cell Seven 




Quantiles of Distributions of Scaled Scores 



Figure B.30. Quantiles of Scale Distributions for Cell 8 (Mathematics Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.31. Quantiles of Scale Distributions for Cell 9 (Mathematics Simulation) 



Math: Cell Nine 




- 2.3651 - 0.7291 0.7641 2.4566 



Quantiles of Distributions of Scaled Scores 



Figure B.32. Quantiles of Scale Distributions for Cell 10 (Mathematics Simulations) 



Math: Cell Ten 
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Quantiles of Distributions of Scaled Scores 
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Figure B.33. Quantiles of Scale Distributions for Cell 1 1 (Mathematics Simulation) 



Math: Cell Eleven 




Quantiles of Distributions of Scaled Scores 



Figure B.34. Quantiles of Scale Distributions for Cell 12 (Mathematics Simulation) 



Math: Cell Twelve 




Quantiles of Distributions of Scaled Scores 
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Figure B.35. Quantiles of Scale Distributions of Cell 13 (Mathematics Simulation) 



Math: Cell Thirteen 




Quantiles of Distributions of Scaled Scores 



Figure B.36. Quantiles of Scale Distributions of Cell 14 (Mathematics Simulation) 



Math: Cell Fourteen 
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Quantiles of Distributions of Scaled Scores 
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Figure B.37. Quantiles of Scale Distributions for Cell 15 (Mathematics Simulation) 



Math: Cell Fifteen 




- 2.3563 - 0.6999 0.7285 2.4549 



Quantiles of Distributions of Scaled Scores 



Figure B.38. Quantiles of Scale Distributions for Cell 16 (Mathematics Simulation) 



Math: Cell Sixteen 
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Quantiles of Distributions of Scaled Scores 
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Figure B.39. Quantiles of Scale Distributions for Cell 17 (Mathematics Simulation) 



Math: Cell Seventeen 




Quantiles of Distributions of Scaled Scores 



Figure B.40. Quantiles of Scale Distributions for Cell 18 (Mathematics Simulation) 




Quantiles of Distributions of Scaled Scores 
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Figure B.41. Quantiles of Scale Distributions for Cell 19 (Mathematics Simulation) 



Math: Cell Nineteen 




Quantiles of Distributions of Scaled Scores 



Figure B.42. Quantiles of Scale Distributions for Cell 20 (Mathematics Simulation) 




Quantiles of Distributions of Scaled Scores 
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